Statistical Mechanics of Support Vector Networks 
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Using methods of Statistical Physics, we investigate the generalization performance of support 
vector machines (SVMs), which have been recently introduced as a general alternative to neural 
networks. For nonlinear classification rules, the generalization error saturates on a plateau, when 
the number of examples is too small to properly estimate the coefficients of the nonlinear part. 
When trained on simple rules, we find that SVMs overfit only weakly. The performance of SVMs is 
strongly enhanced, when the distribution of the inputs has a gap in feature space. 
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Statistical Mechanics provides an important approach 
to analyzing and understanding the ability of neural net- 
works to learn and generalize from examples (see e.g. 

The majority of this work has been devoted to 
the simplest network architecture, the perceptron. This 
network however has limited power, because it classifies 
examples with a simple linear separating hyperplane and 
is able to learn only linear separable rules. More com- 
plicated multilayer neural nets can realize general non- 
linear rules (when the size of their hidden layer is large 
enough) but have also practical and theoretical disad- 
vantages. Learning in these networks results in a usually 
nonconvex optimization problem and there is no guaran- 
tee that an algorithm will find the minimum of the train- 
ing cost function. The complexity of the training error 
surface reflects itself in the theoretical analysis by Sta- 
tistical Mechanics. The occurence of phases of broken 
ergodicity 0] makes their analysis a complicated task. 
Finally, network parameters must be chosen carefully in 
order to adapt the network's complexity on the task and 
to avoid overfitting. 

Recently, a new type of learning machine has been in- 
troduced by V. Vapnik and his collaborators ^|,^| which 
may become a reasonable alternative to neural networks. 
These support vector machines (SVMs) seem to have sev- 
eral advantages over neural networks. Being generaliza- 
tions of perceptrons, their training involves only simple 
convex optimization. Further, for several applications, it 
has been shown that SVMs do not have a strong tendency 
to overfit. 

In this letter, we present a detailed analysis of the typ- 
ical performance of SVMs by methods of Statistical Me- 
chanics. To understand the basic idea behind the SVM 
approach, assume a nonlinear mapping ty(x) from vec- 
tors x € ]R N onto vectors ^ which belong to an M- 
dimensional feature space. A nonlinear classification of 
inputs x can be defined by a linear separation of fea- 
ture vectors ty(x) using a perceptron with weight vec- 
tor w £ IR M perpendicular to the separating hyperplane 
via sign ( \P (x) ■ iJjj . The dot denotes the standard inner 
product of vectors in IR M . The vector w can be adapted 



to a set of example data by any learning algorithm for 
perceptrons. This simple approach has major problems 
which result from the typical high dimensionality of the 
feature space. Assuming e.g., that the vector <J contains 
all bilinear expressions of components of the input vec- 
tor x (in addition to linear ones), the dimension M is of 
order N 2 . First, there is a big computational problem in 
storing and learning the weights and second, one can ex- 
pect that there is also a large tendency of these machines 
to overfit, because there are much less training data than 
adjustable parameters in this model. The main idea to 
overcome these problems is to use the optimal stability 
learning algorithm, which has also been studied exten- 
sively in the Statistical Mechanics approach to neural 
networks (see e.g. M). The goal of this algorithm is to 
find a vector of weights w which allows for a separation of 
positive and negative example points with the maximal 
margin defined by 
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w ft 



in {hp/ V w ■ w} 



The local fields /i M are given by 



(1) 



(2) 



Here, ct m g { — 1, 1} is the classification of the point x^, 
for fi = 1, . . . , to, and to is the total number of labelled 
examples in the training set. This maximization problem 
is found to be equivalent to a quadratic minimization 
problem for the function ^w-w under the constraints that 
hn > 1 for all examples in the training set. According 
to convex optimization theory the solution vector can 
be expanded as a linear combination of example feature 
vectors via 



= Y J ^a^{x^) 



(3) 



where a M > are Lagrange parameters which account for 
the to inequality constraints. Hence, the number of ad- 
justable parameters for this algorithm never exceeds 
the number of examples. The a M are nonzero only for 
those examples, for which = 1, defining the support 
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vectors (SVs) of the data set. If the remaining exam- 
ples (a M = 0) would be discarded from the training set, 
the SVM would predict their correct label a^. Hence, if 
the relative number of SVs is small, we can expect that 
the SVM generalizes well. In fact, a simple argument ||] 
shows that the expected ratio of the number of support 
vectors over m yields an upper bound on the generaliza- 
tion error. We will see later within the average case sce- 
nario of Statistical Mechanics that this mechanism pre- 
vents a complex SVM from overfitting when learning a 
simple rule. 

The expansion (|J) also reduces the computational cost 
of the algorithm drastically because any inner product 
of w with vectors ty(x) in the feature space (including 
w-w) is entirely expressed in terms of the so called kernel 
K(x,y) = $(x)-$(y) = £ p V p (x)V p (y). In particular, 
for any x, we have 

w- = ^a^a^K{x,x^) . (4) 

Hence, both learning and prediction on novel inputs de- 
pend only on the feature vectors \P through the kernel 
K . In fact, there is no need to specify the high dimen- 
sional mapping *&(•) explicitely. Instead, one can directly 
take any reasonable positive semidefinite operator ker- 
nel K, which by Mercer's theorem has a decomposition 
K(x,y) = J2 p ^ P 4>p{x)(f> p (y) in terms of eigenvalues A p 
and orthonormal eigenfunctions 4> p {x) and identify ^ p 
with y^X^tpp. This approach even allows to take kernels 
with feature space dimension M — oo without problems. 

We will now study the generalization performance of 
SVMs within the framework of Statistical Mechanics. We 
define the partition function 




(5) 

which for (3 — > oo is dominated by the solution vector w 
of the SVM algorithm. The properties of the SVM can be 
computed from the average free energy F = —i((\nZ)), 
in the zero temperature limit j3 — > oo, where the double 
brackets denote the average over the distribution of m 
training examples. The main difference from the Statis- 
tical Mechanics of learning in a simple perceptron with M 
weights is that in the SVM, each coupling w p is weighted 
by \f\>, which typically diminishes the influence of the 
more complex, higher order degrees of freedom in the 
eigenvector expansion. As we will see, this makes the 
generalization behavior of the SVM rather different from 
that of a simple perceptron in the thermodynamic limit 
TV — > oo, when the rule to be learnt has a similar eigen- 
vector expansion. We will first consider here a rule of the 
form a p = sign ( \/\>B p (j) p (x tJ -) J where the teacher 



weight vector is given by B p = ±1. We will further av- 
erage the performance over all teachers of this form with 
equal probability for all nonzero components. We will 
specialize on a family of kernels of the form K(x, y) = 

k (^j^j , where the only constraint on the function fc(-) 
is the non- negativity of the eigenvalues. These kernels 
are permutation symmetric in the components of the in- 
put vectors and contain the simple perceptron as a special 
case, when A; is a linear function. This choice has the nice 
feature that for binary input vectors = x E { — 1,1} the 
eigenvalue decomposition of K (x, y) can be explicitely 
calculated 0. The eigenfunctions are labelled by sub- 
sets p C {1,...,N}. We have <j> p {x) = 2- N / 2 J[ iepXi . 
The eigenvalues are A p = 2 N > 2 K (e, x)^ p (x) where 
e = (1, . . . , 1) T , which depend on the cardinality \p\ only 
and show for large N an exponential decay with \p\ like 
_2.fc(M)(0). The corresponding degeneracy grows expo- 
nentially: n M = Q ^ NM/\p\\. 

We expect that a decay of the generalization error, e g , 
to zero should occur only on the scale of m — 0(M), 
since M is the number of learnable parameters. How- 
ever, as we will show, e g may drop to small values al- 
ready on a scale of m — aN examples. Hence, we 
make the general ansatz m = aN 1 , I 6 IN and calcu- 
late fi = lmig^oo limAr^oo N~ l F. 

If we assume that the inputs x^ are drawn at ran- 
dom with respect to a uniform probability distribution 
D(x) on {—1, 1}^, we can perform the average over the 
input distribution by the replica method G|-p[. This 
becomes tractable by the fact that the eigenfunctions 
are orthonormal with respect to D(x) and we have 
2 N {4> p (x)4> p ,(x)) D = ^n&pix^p'ix) = 5 p p>. Further- 
more, all but the constant eigenfunctions have zero mean 
under the uniform distribution. By restricting the ker- 
nels to having fc(0) = 0, the average over the inputs is 
expressed in the thermodynamic limit N — > oo by expec- 
tations over Gaussian random variables. These averages 
can be further expressed by the order parameters 

P 

p 

R = J2^p(w P )B P 
p 

where A p = \ p /2 N , and (...) denotes a statistical mechan- 
ical averaging specified by Eq. (|B|). The generalization er- 
ror is e = — arccos —5— where B = YV A„ = k(\) is the 

squared norm of the teacher vector. In replica symmetry 
(which is expected to be exactly fulfilled by the convex- 
ity of the phase space) we obtain // by extremizing the 
function 
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with respect to the orderparameters g, R and x- Fur- 



-i=e-* / 2 , $(1) = P £>t and 



X 



AW = X)|p|>;A P denotes the sum 

Sm<; A p . 



ther, 

over the higher order components and — ZjU|<j 

As a general result of solving the order parame- 
ter equations we find that all high order components 
|p I > I of the teacher vector are completely undeter- 
mined by learning only O \N l ) examples, in the sense 

that = J2\ P \>i ApWpSp = 0, and also that = 

J2\ P \>i Ap(wp) 2 = 0, in the large N limit. However, as 
we will see, the values of the corresponding weights w p 
are not zero but are determined by the expansion ||). 
On the other hand, all lower order components are com- 
pletely determined, in the sense that w p — cB p for all 
\p\ < I, where c depends on a only. The only compo- 
nents which are actually learnt at a scale I are those for 
|p I = I. We will illustrate these results for quadratic 
kernels of the form k(x) = (1 — d)x 2 + dx, where the 
parameter d, < d < 1 , tunes the degree of nonlinearity 
in the SVM's decision boundary. On a scale of to = aN 
examples (left side of Fig. 1), the SVM is able to learn 
the linear part of the teacher's rule. However, since there 
is not enough information to infer the remaining O (A^ 2 ) 
weights of the teacher's quadratic part, the generaliza- 
tion error of the SVM reaches a nonzero plateau with 
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where e 5 (oo) 



7r 1 arccos Vd. 



This scaling may be understood from the fact that the 
undetermined components w p and B p , with |p| = 2 act 
as a noise term during classification similar to learning 
of perceptrons with weight noise ||. For comparison, 
we also show the performance of a simple linear SVM 
(i.e. a perceptron) for which w p = when \p\ > 1. The 
better performance of the nonlinear SVM does not con- 
tradict the fact that, on the linear scale, its higher order 
weights w p for \p\ — 2 are uncorrelated with the corre- 
sponding teacher values. Those weights are needed to 
learn the training examples perfectly which is not possi- 
ble for the linear machine when a exceeds a critical value 
a c (d), given by ir/a c = arctan7r/(o; c c?). 

Increasing the number of examples to a scale of m = 
aN 2 (right side of Fig. 1), the well known || l/a asymp- 
totic vanishing of e g is found. A similar stepwise learning 
has been obtained for the case of Gibbs learning in higher 
order perceptrons jjj. In general, for kernels which are 
polynomials of order z, more plateaus will appear. On 
the scale of m = aN 1 ^ 1 examples, the generalization er- 
ror decays to a plateau at a — > 00 given by 




m/N 



m/N 



FIG. 1. Decrease of the generalization error on differ- 
ent scales of examples, for quadratic SVM-kernel learning a 
quadratic teacher rule (d — 0.5, B — 1) and various gaps 7. 
The inset compares the SVM to a linear perceptron (upper 
curve) trying to learn the same task. Simulations were per- 
formed with N= 201 and averaged over 50 runs (left and next 
figure), and N= 20, 40 runs (right). 
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= aN z , the generalization 
^mm a -\ This form is 



Finally, at the highest scale to 

error converges to zero as e g r , 

in accordance with general results g which show that (in 
the worst case) the number of examples must be larger 
than the capacity of the classifier in order to achieve a 
small generalization error. The capacity m c — a c N z is 
found from (^|) by solving the order parameter equations 
with the restriction R = 0, as the value of a where qo 
diverges. We obtain a c = — { which agrees with the results 



in 1 10 1 for polynomial separation surfaces in the large N 
limit. 

As the next problem, we study the ability of the SVM 
to cope with the problem of overfitting when learning a 
simple rule. We keep the SVM quadratic, but choose 
a simpler, linear teacher rule according to \B P \ = 1 for 
|/9 1 = 1 and \B P \ = else. The results for the general- 
ization error, obtained by a straightforward extension of 
(^), are shown in Fig. 2, where the number of examples 
is scaled as to = aN. Surprisingly, although the student 
has of O (V 2 ) adjustable parameters, this does not lead 
to any strong overfitting. The SVM is able to learn the 
N teacher weights on the scale of to = aJV examples far 
below capacity. For comparison, we have also shown e g 
for a simple linear SVM (i.e. with w p = for \p\ = 2). 
While for the latter case, the decay of the generalization 
error is of the well known form e g ~ a -1 , the quadratic 
SVM shows the somewhat slower decay e g ~ a~ 2 / 3 . The 
same scaling is obtained for higher order SVMs which 
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learn a low order e.g., a linear, rule. 




m/N 



FIG. 2. Learning curves for linear student and quadratic 
SVM-kernels, all learning a linear teacher rule (B = d). For 
a = 10, a finite size scaling is shown as inset. 

We can shed further light on this interesting result 
by showing that the number of SVs increases like a 2 / 3 , 
hence the relative number of SVs (which is a crude upper 
bound on e g ) decreases like a" 1 / 3 . This can be under- 
stood from the following analysis, which is valid for more 
general classes of input distributions. For simplicity, we 
restrict ourselves to the quadratic SVM learning a linear 
rule. We assume that the inputs have zero mean and 
are sufficiently weakly correlated such that the offdiago- 
nal elements of the quadratic part of the kernel matrix 
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O (1/N). The diagonal elements are 



v are typically 
nn — 1 - d. Eval- 
uating — (x^) ■ w using Eq. (Q) one finds that 
the relative contributions of the off-diagonal elements of 
K are and can be neglected on the linear 

scale m = aN. Hence we obtain hp = v p + (1 — d)a M 
with Vp being the contribution from the linear weights, 
namely, = a p \J d/N w x, where w consists only of w p 
with |p | = 1. Solving for the coefficients a 11 , noting that 
they are nonzero only when h p = 1, we obtain 

a? = (1 - eO _1 (l - V>6(1 - O . (8) 

When a is small, all a' 1 w 1/(1 - d) and the SVM 
acts like a Hebbian classifier. With increasing num- 
ber of examples will grow and the probability that 
a M > (an example is a SV) will decrease. The ex- 
act asymptotic scaling can be calculated selfconsistently 
assuming that for large a, w p ~ cB„ for p — 1 and 

linear contribution to the local field of the teacher vector 



cB p for p 
i^u^ where u p is the 



Using Eq. (g) and noting that v p 

l/c 



we obtain 



valid for large a. Here p(u) denotes the density of the 
teacher linear fields u. Solving Eq. (|^) for c in limit of 
a — > oo yields c ~ (ap(0)/6) 1 ^ 3 . Similarly, the relative 
number of SVs scales as p(0)/c ~ a _1 / 3 p(0) 2 / 3 . 

The dependence on p(0) suggests that the density of 
inputs at the teacher's decision boundary should play a 
crucial role for the generalization ability of the SVM. 
When this density vanishes close to the teacher's sepa- 
rating hypersurface, a much faster decay of the gener- 
alization error can be expected. To study this property 
in more detail, we have analyzed the Statistical Mechan- 
ics for an input distribution correlated with the teacher 

weights such that D(x) ~ 9 (&J2 P y/^pB p (p p (x) — jj 
which has a gap of zero density with size 2j around the 
teacher's decision boundary. As expected, the generaliza- 
tion performance of a quadratic SVM which learns from 
a quadratic teacher is enhanced, but the asymptotic de- 
cay towards the plateau on the linear scale (see Fig. 1) 
is still of the form e g (a) — e g (oo) ~ a -1 . The effect of 
the gap is more dramatic on the highest scale m = aN 2 , 
where instead of an inverse power law, we now find a fast 
drop of the generalization error like e g ~ a^ 3 e^ c ^ a . 
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